Course [IT790A] Explainable AI
Name: Abed Bakkour
Username: a22abeal
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
import shap
import time
import warnings
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
df = pd.read_csv('../data/day-bikesharing.csv',parse_dates=['dteday'],index_col=1) # Load the data
def add_features(df):
df['year'] = df.index.year # yr exists already
df['month'] = df.index.month
df['day'] = df.index.day
df['dayofweek'] = df.index.dayofweek
2011.Jan.1 is a Saturday.
Note: 'dayofweek' 5 means a Saturday.
add_features(df)
df.head(3)
| instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | year | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dteday | |||||||||||||||||||
| 2011-01-01 | 1 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 | 2011 | 1 | 1 | 5 |
| 2011-01-02 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 | 2011 | 1 | 2 | 6 |
| 2011-01-03 | 3 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 | 2011 | 1 | 3 | 0 |
# Let's use 'dayofweek' instead of 'weekday'.
plt.figure(figsize=(16,3))
df['weekday'][:7].plot(linestyle='--', marker='o', color='k', ms=20,
title='weekday feature (confusing with \'6\' and \'0\' for weekends)')
plt.show()
plt.figure(figsize=(16,3))
df['dayofweek'][:14].plot(linestyle='--', marker='o', color='r', ms=20,
title='dayofweek feature')
plt.show()
# Day feature shows whether it is a start of the month or the end of the month (1 to 30 or 31).
plt.figure(figsize=(16,3))
df['day'][:33].plot(linestyle='--', marker='o', color='c', ms=20, title='day feature')
plt.show()
plt.figure(figsize=(16,3))
df['workingday'][:7].plot(linestyle='--', marker='o', color='b', ms=20, title='workingday feature')
plt.show()
plt.figure(figsize=(16,5))
df['season'].plot()
plt.show()
df.season.value_counts()
3 188 2 184 1 181 4 178 Name: season, dtype: int64
plt.figure(figsize=(16,5))
df['cnt'].plot()
plt.show()
plt.scatter(x=df.temp*41,y=df["cnt"]) # denormalize temperature by multiplying 41
plt.grid(True)
plt.xlabel('Temperature')
plt.ylabel('Rental Count')
plt.title('Temperature vs. Rental Count')
plt.show()
# from January to April, the demand is relatively low due to Winter
#(there are not a lot of people that's brave enough to bike on icy road, are there?).
# And the peak demand happends around Summer time (from May to October) and drop again.
group_month = df.groupby(['month'])
average_by_month = group_month['cnt'].mean()
plt.plot(average_by_month.index,average_by_month)
plt.xlabel('Month')
plt.ylabel('Count')
plt.xticks(np.arange(13))
plt.grid(True)
plt.title('Average Rental Count by Month')
Text(0.5, 1.0, 'Average Rental Count by Month')
#correlation of total cnt with other variables
plt.figure(figsize=(16,5))
df.corr()['cnt'].plot(kind='bar')
plt.show()
#This is the normalized version.
# temp: The values are divided to 41 (max)
# atemp: The values are divided to 50 (max)
# hum: The values are divided to 100 (max)
# windspeed: The values are divided to 67 (max)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| instant | 731.0 | 366.000000 | 211.165812 | 1.000000 | 183.500000 | 366.000000 | 548.500000 | 731.000000 |
| season | 731.0 | 2.496580 | 1.110807 | 1.000000 | 2.000000 | 3.000000 | 3.000000 | 4.000000 |
| yr | 731.0 | 0.500684 | 0.500342 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| mnth | 731.0 | 6.519836 | 3.451913 | 1.000000 | 4.000000 | 7.000000 | 10.000000 | 12.000000 |
| holiday | 731.0 | 0.028728 | 0.167155 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| weekday | 731.0 | 2.997264 | 2.004787 | 0.000000 | 1.000000 | 3.000000 | 5.000000 | 6.000000 |
| workingday | 731.0 | 0.683995 | 0.465233 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| weathersit | 731.0 | 1.395349 | 0.544894 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 |
| temp | 731.0 | 0.495385 | 0.183051 | 0.059130 | 0.337083 | 0.498333 | 0.655417 | 0.861667 |
| atemp | 731.0 | 0.474354 | 0.162961 | 0.079070 | 0.337842 | 0.486733 | 0.608602 | 0.840896 |
| hum | 731.0 | 0.627894 | 0.142429 | 0.000000 | 0.520000 | 0.626667 | 0.730209 | 0.972500 |
| windspeed | 731.0 | 0.190486 | 0.077498 | 0.022392 | 0.134950 | 0.180975 | 0.233214 | 0.507463 |
| casual | 731.0 | 848.176471 | 686.622488 | 2.000000 | 315.500000 | 713.000000 | 1096.000000 | 3410.000000 |
| registered | 731.0 | 3656.172367 | 1560.256377 | 20.000000 | 2497.000000 | 3662.000000 | 4776.500000 | 6946.000000 |
| cnt | 731.0 | 4504.348837 | 1937.211452 | 22.000000 | 3152.000000 | 4548.000000 | 5956.000000 | 8714.000000 |
| year | 731.0 | 2011.500684 | 0.500342 | 2011.000000 | 2011.000000 | 2012.000000 | 2012.000000 | 2012.000000 |
| month | 731.0 | 6.519836 | 3.451913 | 1.000000 | 4.000000 | 7.000000 | 10.000000 | 12.000000 |
| day | 731.0 | 15.738714 | 8.809949 | 1.000000 | 8.000000 | 16.000000 | 23.000000 | 31.000000 |
| dayofweek | 731.0 | 3.002736 | 2.004787 | 0.000000 | 1.000000 | 3.000000 | 5.000000 | 6.000000 |
df.dtypes
instant int64 season int64 yr int64 mnth int64 holiday int64 weekday int64 workingday int64 weathersit int64 temp float64 atemp float64 hum float64 windspeed float64 casual int64 registered int64 cnt int64 year int64 month int64 day int64 dayofweek int64 dtype: object
Perform one hot encoding on categorical variables. Machine learning algorithms assume (and require) the data to be numeric, so the categorical data must be pre-processed as a numerical data. Categorical variables turn into binary features that are 'one hot' encoded, so if a feature is represented by that column, it receives a 1, otherwise a 0.
# Define the categorical variables
cat_var = ['season', 'holiday', 'workingday', 'weathersit', 'year']
# OneHotEncoder
ohe = OneHotEncoder(categories = 'auto')
# Fit the categorical variables to the encoder
encodeddf = ohe.fit_transform(df[cat_var])
# Create a DataFrame with the encoded value information
cat_df = pd.DataFrame(data = encodeddf.toarray(), columns = ohe.get_feature_names_out(cat_var))
cat_df
| season_1 | season_2 | season_3 | season_4 | holiday_0 | holiday_1 | workingday_0 | workingday_1 | weathersit_1 | weathersit_2 | weathersit_3 | year_2011 | year_2012 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 726 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 727 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 728 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 729 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 730 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
731 rows × 13 columns
# rename the column names from 1,2,3,4 to spring, summer, fall, winter
cat_df.rename(columns={'season_1': 'winter', 'season_2': 'spring', 'season_3': 'summer', 'season_4': 'fall'}, inplace=True)
cat_df.rename(columns={'weathersit_1': 'clear', 'weathersit_2': 'cloudy', 'weathersit_3': 'lightsnow.rain'}, inplace=True)
cat_df
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 726 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 727 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 728 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 729 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 730 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
731 rows × 13 columns
Machine learning is sensitive to the relative scales of numerical variables. The range of all variables need to be normalized so that each feature contributes approximately proportionately to the calculation.
# Define the numerical variables
num_var = ['temp', 'atemp', 'hum', 'windspeed', 'month', 'day', 'dayofweek']
Create standardized, normalized
# StandardScaler object
scaler = StandardScaler()
# Fit the data to the scaler
numscaled = scaler.fit_transform(df[num_var])
Keep the original for later view
# for inverse transformation
inversed = scaler.inverse_transform(numscaled)
print(inversed)
[[ 0.344167 0.363625 0.805833 ... 1. 1. 5. ] [ 0.363478 0.353739 0.696087 ... 1. 2. 6. ] [ 0.196364 0.189405 0.437273 ... 1. 3. 0. ] ... [ 0.253333 0.2424 0.752917 ... 12. 29. 5. ] [ 0.255833 0.2317 0.483333 ... 12. 30. 6. ] [ 0.215833 0.223487 0.5775 ... 12. 31. 0. ]]
# Create DataFrame with original data
inversed_df = pd.DataFrame(data = inversed, columns = num_var)
inversed_df
# Calculate based on UCI info to retrieve the actual temperature and weather information
#temp : Normalized temperature in Celsius. The values are divided to 41 (max)
#atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
#hum: Normalized humidity. The values are divided to 100 (max)
#windspeed: Normalized wind speed. The values are divided to 67 (max)
inversed_origin = pd.concat([inversed_df.iloc[:,0] * 41, #temperature
inversed_df.iloc[:,1] * 50, #feeling temperature
inversed_df.iloc[:,2] * 100, #humidity
inversed_df.iloc[:,3] * 67], #wind speed
axis=1)
inversed_origin
| temp | atemp | hum | windspeed | |
|---|---|---|---|---|
| 0 | 14.110847 | 18.18125 | 80.5833 | 10.749882 |
| 1 | 14.902598 | 17.68695 | 69.6087 | 16.652113 |
| 2 | 8.050924 | 9.47025 | 43.7273 | 16.636703 |
| 3 | 8.200000 | 10.60610 | 59.0435 | 10.739832 |
| 4 | 9.305237 | 11.46350 | 43.6957 | 12.522300 |
| ... | ... | ... | ... | ... |
| 726 | 10.420847 | 11.33210 | 65.2917 | 23.458911 |
| 727 | 10.386653 | 12.75230 | 59.0000 | 10.416557 |
| 728 | 10.386653 | 12.12000 | 75.2917 | 8.333661 |
| 729 | 10.489153 | 11.58500 | 48.3333 | 23.500518 |
| 730 | 8.849153 | 11.17435 | 57.7500 | 10.374682 |
731 rows × 4 columns
# timeseries of temp, atemp, humidity, windspeed
ax = inversed_origin.set_index(inversed_origin.index).plot(figsize=(20,5))
ax.set_xlabel('daily time (# rows)')
Text(0.5, 0, 'daily time (# rows)')
X_original = pd.concat([cat_df, inversed_origin, inversed_df[['month', 'day', 'dayofweek']]], axis=1)
X_original # Use this for SHAP plot view
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 14.110847 | 18.18125 | 80.5833 | 10.749882 | 1.0 | 1.0 | 5.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 14.902598 | 17.68695 | 69.6087 | 16.652113 | 1.0 | 2.0 | 6.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 8.050924 | 9.47025 | 43.7273 | 16.636703 | 1.0 | 3.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 8.200000 | 10.60610 | 59.0435 | 10.739832 | 1.0 | 4.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9.305237 | 11.46350 | 43.6957 | 12.522300 | 1.0 | 5.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 726 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 10.420847 | 11.33210 | 65.2917 | 23.458911 | 12.0 | 27.0 | 3.0 |
| 727 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 10.386653 | 12.75230 | 59.0000 | 10.416557 | 12.0 | 28.0 | 4.0 |
| 728 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 10.386653 | 12.12000 | 75.2917 | 8.333661 | 12.0 | 29.0 | 5.0 |
| 729 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 10.489153 | 11.58500 | 48.3333 | 23.500518 | 12.0 | 30.0 | 6.0 |
| 730 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 8.849153 | 11.17435 | 57.7500 | 10.374682 | 12.0 | 31.0 | 0.0 |
731 rows × 20 columns
# Create DataFrame with numerically scaled data
num_df = pd.DataFrame(data = numscaled, columns = num_var)
num_df
| temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|
| 0 | -0.826662 | -0.679946 | 1.250171 | -0.387892 | -1.600161 | -1.674108 | 0.996930 |
| 1 | -0.721095 | -0.740652 | 0.479113 | 0.749602 | -1.600161 | -1.560522 | 1.496077 |
| 2 | -1.634657 | -1.749767 | -1.339274 | 0.746632 | -1.600161 | -1.446936 | -1.498809 |
| 3 | -1.614780 | -1.610270 | -0.263182 | -0.389829 | -1.600161 | -1.333351 | -0.999661 |
| 4 | -1.467414 | -1.504971 | -1.341494 | -0.046307 | -1.600161 | -1.219765 | -0.500513 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 726 | -1.318665 | -1.521108 | 0.175807 | 2.061426 | 1.588660 | 1.279122 | -0.001366 |
| 727 | -1.323224 | -1.346690 | -0.266238 | -0.452131 | 1.588660 | 1.392707 | 0.497782 |
| 728 | -1.323224 | -1.424344 | 0.878392 | -0.853552 | 1.588660 | 1.506293 | 0.996930 |
| 729 | -1.309558 | -1.490049 | -1.015664 | 2.069444 | 1.588660 | 1.619879 | 1.496077 |
| 730 | -1.528225 | -1.540482 | -0.354061 | -0.460201 | 1.588660 | 1.733465 | -1.498809 |
731 rows × 7 columns
Define X (input): Concatenate one hot encoded categorical + normalized numerical data
Xdf = pd.concat([cat_df, num_df], axis = 1)
print(Xdf.shape)
Xdf.head(5)
(731, 20)
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | -0.826662 | -0.679946 | 1.250171 | -0.387892 | -1.600161 | -1.674108 | 0.996930 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | -0.721095 | -0.740652 | 0.479113 | 0.749602 | -1.600161 | -1.560522 | 1.496077 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.634657 | -1.749767 | -1.339274 | 0.746632 | -1.600161 | -1.446936 | -1.498809 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.614780 | -1.610270 | -0.263182 | -0.389829 | -1.600161 | -1.333351 | -0.999661 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.467414 | -1.504971 | -1.341494 | -0.046307 | -1.600161 | -1.219765 | -0.500513 |
# Save column names for SHAP usage
NAMES = Xdf.columns
NAMES
Index(['winter', 'spring', 'summer', 'fall', 'holiday_0', 'holiday_1',
'workingday_0', 'workingday_1', 'clear', 'cloudy', 'lightsnow.rain',
'year_2011', 'year_2012', 'temp', 'atemp', 'hum', 'windspeed', 'month',
'day', 'dayofweek'],
dtype='object')
plt.subplots(figsize=(8, 6))
sb.heatmap(Xdf.corr(), cmap='coolwarm')
<AxesSubplot: >
# Check if there is any non available
Xdf.isna().any().sum()
0
Define Y (to predict): Extract target
# Here, indicate which target to predict (cnt or casual or registered)
# For the different models for each customer in the assignment, change the target
to_predict = 'cnt'
# Here, indicate which target to predict (cnt or casual or registered)
y = df[to_predict]
# Normalize target data: Calculate mean and standard deviation
y_mu = y.mean(0)
y_sd = y.std(0)
print('y mean:', y_mu, ' y std:' , y_sd)
y_norm = (y - y_mu) / y_sd
y_norm = y_norm.values.ravel() #dataframe to 1D array
#y_norm
y mean: 4504.3488372093025 y std: 1937.2114516187676
len(y)
731
# to verify, leave a whole dataframe with x and y together
tdf = pd.concat([Xdf, pd.DataFrame(y.values)], axis = 1)
tdf.rename(columns={0:'cnt'}, inplace=True) #change the column name
tdf
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | ... | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | -0.826662 | -0.679946 | 1.250171 | -0.387892 | -1.600161 | -1.674108 | 0.996930 | 985 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | -0.721095 | -0.740652 | 0.479113 | 0.749602 | -1.600161 | -1.560522 | 1.496077 | 801 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | -1.634657 | -1.749767 | -1.339274 | 0.746632 | -1.600161 | -1.446936 | -1.498809 | 1349 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | -1.614780 | -1.610270 | -0.263182 | -0.389829 | -1.600161 | -1.333351 | -0.999661 | 1562 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | -1.467414 | -1.504971 | -1.341494 | -0.046307 | -1.600161 | -1.219765 | -0.500513 | 1600 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 726 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | -1.318665 | -1.521108 | 0.175807 | 2.061426 | 1.588660 | 1.279122 | -0.001366 | 2114 |
| 727 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | -1.323224 | -1.346690 | -0.266238 | -0.452131 | 1.588660 | 1.392707 | 0.497782 | 3095 |
| 728 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | -1.323224 | -1.424344 | 0.878392 | -0.853552 | 1.588660 | 1.506293 | 0.996930 | 1341 |
| 729 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | -1.309558 | -1.490049 | -1.015664 | 2.069444 | 1.588660 | 1.619879 | 1.496077 | 1796 |
| 730 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | -1.528225 | -1.540482 | -0.354061 | -0.460201 | 1.588660 | 1.733465 | -1.498809 | 2729 |
731 rows × 21 columns
Split training and test data
# Split the data into training, validation, and test data:
X_train, X_test, y_train, y_test = train_test_split(Xdf, y_norm, test_size = 0.2, random_state=1)
# Validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
# Build the model with
# XGBoost Training Parameter Reference:
# https://github.com/dmlc/xgboost
# https://xgboost.readthedocs.io/en/latest/parameter.html
# XGBRegressor object and GridSearch object
xgbmodel = xgb.XGBRegressor()
clf = GridSearchCV(estimator = xgbmodel,
param_grid = params,
cv = 3,
verbose=1)
# Fit the training data to the GridSearch object
t0 = time.time()
clf.fit(X_train, y_train)
t1 = time.time()
print(t1-t0, " sec taken")
Fitting 3 folds for each of 405 candidates, totalling 1215 fits 82.36391687393188 sec taken
# Display the best parameters from the GridSearch
# Let's assume that this is a good enough model to use for SHAP
print(clf.best_params_)
print(clf.best_score_)
{'colsample_bytree': 0.8, 'gamma': 0.5, 'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.8}
0.8633448414950647
from sklearn._config import get_config, set_config
set_config(print_changed_only=False)
xgbmodel = xgb.XGBRegressor(**clf.best_params_, random_state=0)
# Fit the training data to the model
xgbmodel.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_val, y_val)])
[0] validation_0-rmse:0.84811 validation_1-rmse:0.86342 [1] validation_0-rmse:0.67246 validation_1-rmse:0.69370 [2] validation_0-rmse:0.54534 validation_1-rmse:0.58267 [3] validation_0-rmse:0.45971 validation_1-rmse:0.50922 [4] validation_0-rmse:0.40182 validation_1-rmse:0.45342 [5] validation_0-rmse:0.36729 validation_1-rmse:0.40947 [6] validation_0-rmse:0.33601 validation_1-rmse:0.38506 [7] validation_0-rmse:0.31756 validation_1-rmse:0.37200 [8] validation_0-rmse:0.30014 validation_1-rmse:0.36170 [9] validation_0-rmse:0.28967 validation_1-rmse:0.35561 [10] validation_0-rmse:0.28115 validation_1-rmse:0.34970 [11] validation_0-rmse:0.27210 validation_1-rmse:0.34349 [12] validation_0-rmse:0.27027 validation_1-rmse:0.33654 [13] validation_0-rmse:0.26416 validation_1-rmse:0.33288 [14] validation_0-rmse:0.25919 validation_1-rmse:0.33080 [15] validation_0-rmse:0.25919 validation_1-rmse:0.33081 [16] validation_0-rmse:0.25706 validation_1-rmse:0.32948 [17] validation_0-rmse:0.25709 validation_1-rmse:0.32944 [18] validation_0-rmse:0.25387 validation_1-rmse:0.32742 [19] validation_0-rmse:0.25387 validation_1-rmse:0.32742 [20] validation_0-rmse:0.24894 validation_1-rmse:0.32367 [21] validation_0-rmse:0.24895 validation_1-rmse:0.32364 [22] validation_0-rmse:0.24892 validation_1-rmse:0.32373 [23] validation_0-rmse:0.24595 validation_1-rmse:0.32242 [24] validation_0-rmse:0.24157 validation_1-rmse:0.32198 [25] validation_0-rmse:0.24158 validation_1-rmse:0.32204 [26] validation_0-rmse:0.24157 validation_1-rmse:0.32200 [27] validation_0-rmse:0.24158 validation_1-rmse:0.32201 [28] validation_0-rmse:0.24160 validation_1-rmse:0.32208 [29] validation_0-rmse:0.23884 validation_1-rmse:0.31778 [30] validation_0-rmse:0.23884 validation_1-rmse:0.31774 [31] validation_0-rmse:0.23884 validation_1-rmse:0.31773 [32] validation_0-rmse:0.23885 validation_1-rmse:0.31768 [33] validation_0-rmse:0.23886 validation_1-rmse:0.31765 [34] validation_0-rmse:0.23274 validation_1-rmse:0.31404 [35] validation_0-rmse:0.23275 validation_1-rmse:0.31403 [36] validation_0-rmse:0.23028 validation_1-rmse:0.30987 [37] validation_0-rmse:0.23027 validation_1-rmse:0.30990 [38] validation_0-rmse:0.23027 validation_1-rmse:0.30989 [39] validation_0-rmse:0.23028 validation_1-rmse:0.30993 [40] validation_0-rmse:0.23028 validation_1-rmse:0.30986 [41] validation_0-rmse:0.23028 validation_1-rmse:0.30986 [42] validation_0-rmse:0.23028 validation_1-rmse:0.30987 [43] validation_0-rmse:0.23028 validation_1-rmse:0.30985 [44] validation_0-rmse:0.23028 validation_1-rmse:0.30986 [45] validation_0-rmse:0.23029 validation_1-rmse:0.30985 [46] validation_0-rmse:0.23029 validation_1-rmse:0.30985 [47] validation_0-rmse:0.23030 validation_1-rmse:0.30983 [48] validation_0-rmse:0.23028 validation_1-rmse:0.30988 [49] validation_0-rmse:0.23028 validation_1-rmse:0.30987 [50] validation_0-rmse:0.23029 validation_1-rmse:0.30985 [51] validation_0-rmse:0.23028 validation_1-rmse:0.30988 [52] validation_0-rmse:0.23029 validation_1-rmse:0.30997 [53] validation_0-rmse:0.23028 validation_1-rmse:0.30988 [54] validation_0-rmse:0.23028 validation_1-rmse:0.30992 [55] validation_0-rmse:0.23028 validation_1-rmse:0.30988 [56] validation_0-rmse:0.23027 validation_1-rmse:0.30990 [57] validation_0-rmse:0.23027 validation_1-rmse:0.30990 [58] validation_0-rmse:0.23028 validation_1-rmse:0.30988 [59] validation_0-rmse:0.23027 validation_1-rmse:0.30991 [60] validation_0-rmse:0.23028 validation_1-rmse:0.30992 [61] validation_0-rmse:0.22742 validation_1-rmse:0.30678 [62] validation_0-rmse:0.22739 validation_1-rmse:0.30670 [63] validation_0-rmse:0.22738 validation_1-rmse:0.30665 [64] validation_0-rmse:0.22738 validation_1-rmse:0.30668 [65] validation_0-rmse:0.22738 validation_1-rmse:0.30669 [66] validation_0-rmse:0.22738 validation_1-rmse:0.30663 [67] validation_0-rmse:0.22738 validation_1-rmse:0.30662 [68] validation_0-rmse:0.22737 validation_1-rmse:0.30663 [69] validation_0-rmse:0.22532 validation_1-rmse:0.30405 [70] validation_0-rmse:0.22532 validation_1-rmse:0.30406 [71] validation_0-rmse:0.22532 validation_1-rmse:0.30405 [72] validation_0-rmse:0.22533 validation_1-rmse:0.30408 [73] validation_0-rmse:0.22533 validation_1-rmse:0.30407 [74] validation_0-rmse:0.22532 validation_1-rmse:0.30404 [75] validation_0-rmse:0.22533 validation_1-rmse:0.30404 [76] validation_0-rmse:0.22532 validation_1-rmse:0.30405 [77] validation_0-rmse:0.22532 validation_1-rmse:0.30405 [78] validation_0-rmse:0.22532 validation_1-rmse:0.30405 [79] validation_0-rmse:0.22372 validation_1-rmse:0.30350 [80] validation_0-rmse:0.22370 validation_1-rmse:0.30357 [81] validation_0-rmse:0.22370 validation_1-rmse:0.30358 [82] validation_0-rmse:0.22371 validation_1-rmse:0.30361 [83] validation_0-rmse:0.22372 validation_1-rmse:0.30350 [84] validation_0-rmse:0.22371 validation_1-rmse:0.30350 [85] validation_0-rmse:0.22371 validation_1-rmse:0.30350 [86] validation_0-rmse:0.22370 validation_1-rmse:0.30351 [87] validation_0-rmse:0.22369 validation_1-rmse:0.30355 [88] validation_0-rmse:0.22370 validation_1-rmse:0.30352 [89] validation_0-rmse:0.22370 validation_1-rmse:0.30352 [90] validation_0-rmse:0.22373 validation_1-rmse:0.30349 [91] validation_0-rmse:0.22369 validation_1-rmse:0.30354 [92] validation_0-rmse:0.22370 validation_1-rmse:0.30357 [93] validation_0-rmse:0.22369 validation_1-rmse:0.30355 [94] validation_0-rmse:0.22369 validation_1-rmse:0.30356 [95] validation_0-rmse:0.22369 validation_1-rmse:0.30355 [96] validation_0-rmse:0.22169 validation_1-rmse:0.30439 [97] validation_0-rmse:0.22168 validation_1-rmse:0.30437 [98] validation_0-rmse:0.22168 validation_1-rmse:0.30431 [99] validation_0-rmse:0.22167 validation_1-rmse:0.30431
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.8, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=4, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
objective='reg:squarederror', predictor=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.8, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=4, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
objective='reg:squarederror', predictor=None, ...)eval_result = xgbmodel.evals_result()
training_rounds = range(len(eval_result['validation_0']['rmse'])) #training
print(training_rounds)
range(0, 100)
plt.figure(figsize=(8,4))
plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Test Error')
plt.grid(True)
plt.xlabel('Iteration')
plt.ylabel('RMSE')
plt.title('Training Vs Test Error')
plt.legend()
plt.show()
ypredicted = xgbmodel.predict(X_test)
plt.figure(figsize=(10,3))
# Actual Vs Predicted
plt.plot(y_test, label='Actual')
plt.plot(ypredicted, label='Predicted')
plt.xlabel('Sample')
plt.ylabel('Count')
plt.xlim([100,150])
plt.title('Validation Dataset - Predicted Vs. Actual')
plt.legend()
plt.show()
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = (y_test - ypredicted) # * y_sd + y_mu (for actual cnt numbers)
plt.figure(figsize=(10,3))
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('Residuals Distribution')
plt.axvline(color='r')
plt.show()
print("RMSE: {0:0.4f}".format(mean_squared_error(y_test, ypredicted)))
RMSE: 0.0884
# Show the influencing features in order (with feature_importances_)
# Indicate which parameter used to decide the ranking of important features
important_features = pd.Series(data=xgbmodel.feature_importances_, index=NAMES)
important_features.sort_values(ascending=False,inplace=True)
print(important_features)
# There are different parameters:
# Refer to, https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py
# * 'weight': the number of times a feature is used to split the data across all trees.
# * 'gain': the average gain across all splits the feature is used in.
# * 'cover': the average coverage across all splits the feature is used in.
# * 'total_gain': the total gain across all splits the feature is used in.
# * 'total_cover': the total coverage across all splits the feature is used in.
year_2012 0.334441 month 0.148238 year_2011 0.093973 atemp 0.091646 winter 0.060549 fall 0.057606 clear 0.034687 cloudy 0.033488 temp 0.031691 lightsnow.rain 0.028922 hum 0.022926 windspeed 0.017095 workingday_0 0.010901 holiday_0 0.009047 day 0.007691 dayofweek 0.007369 workingday_1 0.007166 spring 0.002563 holiday_1 0.000000 summer 0.000000 dtype: float32
important_features = pd.Series(data=xgbmodel.get_booster().get_score(importance_type='gain'), index=NAMES)
important_features.sort_values(ascending=False,inplace=True)
print(important_features)
year_2012 24.965616 month 11.065815 year_2011 7.015013 atemp 6.841238 winter 4.519948 fall 4.300218 clear 2.589336 cloudy 2.499859 temp 2.365730 lightsnow.rain 2.159016 hum 1.711424 windspeed 1.276122 workingday_0 0.813745 holiday_0 0.675331 day 0.574106 dayofweek 0.550102 workingday_1 0.534939 spring 0.191319 summer NaN holiday_1 NaN dtype: float64
# Show the influencing features in order (with get_booster().get_fscore())
# fscore is based on the number of times a feature is used in a tree
fscore = xgbmodel.get_booster().get_fscore().items()
lists = sorted(fscore, key=lambda x:x[1]) # sorted by value, return a list of tuples
fkey, fvalue = zip(*lists) # unpack a list of pairs into two tuples
plt.barh(fkey, fvalue)
plt.show()
To find a single day to examine, search for conditions and pick a specific index to investigate
# weekend and winter
sampled_at_hu = X_test[(X_test['workingday_0']==1) & (X_test['winter']==1)][:5]
sampled_at_hu
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.951895 | -2.199446 | -1.361096 | 2.214012 | -1.600161 | -0.765422 | 1.496077 |
| 364 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.466773 | -0.369869 | -0.084739 | 0.383083 | 1.588660 | 1.733465 | 0.996930 |
| 414 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | -1.177444 | -1.280727 | -0.787324 | 0.808650 | -1.310268 | 0.370436 | 1.496077 |
| 434 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.136444 | -1.207168 | -1.949511 | 0.455341 | -1.020375 | -0.651836 | 0.996930 |
| 421 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.181998 | -1.265253 | -1.530890 | 0.190365 | -1.310268 | 1.165536 | 1.496077 |
PICK = 364
Xdf[Xdf.index==PICK]
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 364 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.466773 | -0.369869 | -0.084739 | 0.383083 | 1.58866 | 1.733465 | 0.99693 |
# review the target values, y[364] and tdf[tdf.index == 364]['cnt'].values[0] show target cnt = 2485
print(tdf[tdf.index == PICK]['cnt'].values[0])
2485
Xdf.iloc[[PICK]]
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 364 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.466773 | -0.369869 | -0.084739 | 0.383083 | 1.58866 | 1.733465 | 0.99693 |
Keep one day for local explanation later.
Find the appropriate single day to explain. Here, we focus on days in indices 364 and 180.
# Select one day for investigation with a force plot later
weekEnd_sample = Xdf.loc[[364]] #weekend-winter-364 #working day-winter-40
weekDay_sample = Xdf.loc[[180]] #working day-summer-180 #weekend-summer-247
weekEnd_sample_origin = X_original.loc[[364]]
weekDay_sample_origin = X_original.loc[[180]]
weekEnd_sample_origin #weekDay_sample_origin
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 364 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 16.81 | 20.70605 | 61.5833 | 14.750318 | 12.0 | 31.0 | 5.0 |
X_original.head(5)
| winter | spring | summer | fall | holiday_0 | holiday_1 | workingday_0 | workingday_1 | clear | cloudy | lightsnow.rain | year_2011 | year_2012 | temp | atemp | hum | windspeed | month | day | dayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 14.110847 | 18.18125 | 80.5833 | 10.749882 | 1.0 | 1.0 | 5.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 14.902598 | 17.68695 | 69.6087 | 16.652113 | 1.0 | 2.0 | 6.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 8.050924 | 9.47025 | 43.7273 | 16.636703 | 1.0 | 3.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 8.200000 | 10.60610 | 59.0435 | 10.739832 | 1.0 | 4.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9.305237 | 11.46350 | 43.6957 | 12.522300 | 1.0 | 5.0 | 2.0 |
# activate javascript, load javascript library when visualization is not showing
shap.initjs()
# explain all the predictions in the test set
t0 = time.time()
explainer = shap.TreeExplainer(xgbmodel)
shap_values = explainer.shap_values(Xdf)
t1 = time.time()
print(t1-t0, " sec taken")
0.0890500545501709 sec taken
Note: xgbmodel contains training data set that it was built upon.
TreeExplainer uses this model to compute conditional expectations.
Pass to explainer.shap_values the data samples to explain: it is often a test set, but in this case we use the full set.
In our journey to predict bike sharing demand, we embarked on a series of data-driven steps. We began by meticulously preparing the dataset, engineering features like date components, temperature, and weather conditions. Categorical variables were transformed, and numerical features standardized for consistency. With the groundwork laid, we turned to model selection and chose the robust XGBoostRegressor, known for its prowess in regression tasks. Hyperparameter tuning further optimized the model's predictive power, and we embarked on training, closely monitoring performance with an evaluation set. The fruits of our labor were revealed as we evaluated the model, achieving an impressive RMSE of 0.0884, a testament to its accuracy. We delved deeper into feature importance, unearthing insights into the factors influencing bike rental counts. To complete our journey, SHAP values illuminated individual predictions, unraveling the intricate web of feature contributions. With these insights, we stand ready to unlock the secrets of bike sharing demand.
A-1. Based on the model and explanation results, explain how different seasons and three weather inputs (e.g., temperature, humidity, and windspeed) would affect the bike rental counts positively and/or negatively (if none, state that there would be no influence)?
If you are the CEO of this bike rental company, what is your marketing strategy for selected seasons or weather?
selected_features = ["winter", "spring", "summer", "fall", "temp", "hum", "windspeed"]
# Find the indices of the selected features in the original DataFrame
feature_indices = [Xdf.columns.get_loc(feature) for feature in selected_features]
# Sort the SHAP values and DataFrame based on the feature indices
selected_shap_values = shap_values[:, feature_indices]
selected_Xdf = Xdf[selected_features]
# Create a SHAP summary plot for the selected features
shap.summary_plot(selected_shap_values, selected_Xdf, show=False, axis_color="black", plot_size=(20,5))
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
The SHAP summary plot is a graphical tool that shows how different features (such as seasons and weather inputs) influence the prediction of a machine learning model (such as bike rental counts). The plot has two axes: the horizontal axis shows the SHAP value, which measures the impact of a feature on the model output, and the vertical axis shows the feature value, which indicates the actual value of a feature for each data point. The data points are colored according to the feature value, with purple representing high values and blue representing low values. The plot also has horizontal lines for each feature, which show how the feature values are distributed along the SHAP value axis. The more spread out the data points are along a line, the more variable the impact of that feature is on the model output. The closer the data points are to the center of the plot, the less impact that feature has on the model output.
Based on the SHAP summary plot and explanation results, here is how different seasons and three weather inputs would affect the bike rental counts:
Therefore, based on the model and explanation results, we can conclude that:
If I were the CEO of this bike rental company, I would use the following marketing strategy for selected seasons or weather:
A-2. Based on the model and explanation results, which type of days (“working days” and the “holiday/weekends”) would influence the bike rental counts positively and/or negatively (if none, state that there would be no influence)?
If you are the CEO of this bike rental company, what is your marketing strategy for the working days and holiday/weekends (select the type of day and how to improve)?
selected_features = ["holiday_0", "holiday_1", "dayofweek"]
# Find the indices of the selected features in the original DataFrame
feature_indices = [Xdf.columns.get_loc(feature) for feature in selected_features]
# Sort the SHAP values and DataFrame based on the feature indices
selected_shap_values = shap_values[:, feature_indices]
selected_Xdf = Xdf[selected_features]
# Create a SHAP summary plot for the selected features
shap.summary_plot(selected_shap_values, selected_Xdf)
# Create a Dependence plot for the "dayofweek" feature
shap.dependence_plot("dayofweek", shap_values, X_original, feature_names=NAMES, show=False, interaction_index = None, title= "Dependence Plot of Day of Week")
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
To assess the influence of working days, weekends, and holidays on bike rental counts, we created two informative plots: a summary plot and a dependence plot. These visualizations shed light on the impact of these features, particularly the "dayofweek," "holiday_0," and "holiday_1" features. Noting that the value 0 means that the day is not a holiday, while the value 1 means that the day is a holiday.
Summary Plot: The first plot, known as the SHAP summary plot, provides an overview of how these three features ("dayofweek," "holiday_0," and "holiday_1") collectively influence bike rental counts. It presents a visual representation of their impact, with SHAP values on the horizontal axis and the feature values on the vertical axis. The SHAP values indicate the magnitude and direction of influence, while the color of data points corresponds to feature values (purple for high values and blue for low values). This plot helps us understand the overall impact of these features on bike rental counts.
Dependence Plot: The second plot is a dependence plot specifically focused on the "dayofweek" feature. This plot offers a detailed insight into how different days of the week affect bike rental counts. The "dayofweek" values, ranging from 0 (Monday) to 6 (Sunday), are displayed on the x-axis, while the corresponding SHAP values are on the y-axis. The plot reveals the individual influence of each day and allows us to pinpoint which days have a positive, negative, or negligible impact on bike rentals.
Based on the dependence plot (one feature: "dayofweek") and the SHAP summary plot of three features ("dayofweek", "holiday_0" and "holiday_1"), here is how different types of days would influence the bike rental counts positively and/or negatively:
Influence of Days:
Marketing Strategy: As the CEO of the bike rental company, the XAI insights guide our marketing strategy:
By tailoring marketing strategies to the specific days that impact rental counts and making data-driven decisions, the bike rental company can optimize its services and cater to customer preferences effectively.
A-3. Based on the model and results, explain the differences and similarities between casual and registered customers and how those differ based on the type of day (working day vs. weekend). If none, state that there would be no influence from different types of customers.
If you are the CEO of this bike rental company, what is your marketing strategy for different customers (casual vs. registered)?
# Here, indicate which target to predict (casual)
to_predict = 'casual'
y = df[to_predict]
# Normalize target data: Calculate mean and standard deviation
y_mu = y.mean(0)
y_sd = y.std(0)
y_norm = (y - y_mu) / y_sd
y_norm = y_norm.values.ravel() #dataframe to 1D array
# to verify, leave a whole dataframe with x and y together
tdf = pd.concat([Xdf, pd.DataFrame(y.values)], axis = 1)
tdf.rename(columns={0:'casusal'}, inplace=True) #change the column name
# Split the data into training, validation, and test data:
X_train, X_test, y_train, y_test = train_test_split(Xdf, y_norm, test_size = 0.2, random_state=1)
# Validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
# Build the model with
# XGBRegressor object and GridSearch object
xgbmodel = xgb.XGBRegressor()
clf = GridSearchCV(estimator = xgbmodel,
param_grid = params,
cv = 3,
verbose=1)
# Fit the training data to the GridSearch object
t0 = time.time()
clf.fit(X_train, y_train)
t1 = time.time()
from sklearn._config import get_config, set_config
set_config(print_changed_only=False)
xgbmodel = xgb.XGBRegressor(**clf.best_params_, random_state=0)
# Fit the training data to the model
xgbmodel.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_val, y_val)], verbose=0)
# Get the RMSE
ypredicted = xgbmodel.predict(X_test)
print("RMSE (casual): {0:0.4f}".format(mean_squared_error(y_test, ypredicted)))
# activate javascript, load javascript library when visualization is not showing
shap.initjs()
# explain all the predictions in the test set
explainer_casual = shap.TreeExplainer(xgbmodel)
shap_values_casual = explainer_casual.shap_values(Xdf)
#########################################################################################################################################################
#########################################################################################################################################################
#########################################################################################################################################################
# Here, indicate which target to predict (registered)
to_predict = 'registered'
y = df[to_predict]
# Normalize target data: Calculate mean and standard deviation
y_mu = y.mean(0)
y_sd = y.std(0)
y_norm = (y - y_mu) / y_sd
y_norm = y_norm.values.ravel() #dataframe to 1D array
# to verify, leave a whole dataframe with x and y together
tdf = pd.concat([Xdf, pd.DataFrame(y.values)], axis = 1)
tdf.rename(columns={0:'casusal'}, inplace=True) #change the column name
# Split the data into training, validation, and test data:
X_train, X_test, y_train, y_test = train_test_split(Xdf, y_norm, test_size = 0.2, random_state=1)
# Validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
# Build the model with
# XGBRegressor object and GridSearch object
xgbmodel = xgb.XGBRegressor()
clf = GridSearchCV(estimator = xgbmodel,
param_grid = params,
cv = 3,
verbose=1)
# Fit the training data to the GridSearch object
t0 = time.time()
clf.fit(X_train, y_train)
t1 = time.time()
from sklearn._config import get_config, set_config
set_config(print_changed_only=False)
xgbmodel = xgb.XGBRegressor(**clf.best_params_, random_state=0)
# Fit the training data to the model
xgbmodel.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_val, y_val)], verbose=0)
# Get the RMSE
ypredicted = xgbmodel.predict(X_test)
print("RMSE (registered): {0:0.4f}".format(mean_squared_error(y_test, ypredicted)))
# activate javascript, load javascript library when visualization is not showing
shap.initjs()
# explain all the predictions in the test set
explainer_registered = shap.TreeExplainer(xgbmodel)
shap_values_registered = explainer_registered.shap_values(Xdf)
Fitting 3 folds for each of 405 candidates, totalling 1215 fits RMSE (casual): 0.1744
Fitting 3 folds for each of 405 candidates, totalling 1215 fits RMSE (registered): 0.1160
# Create a figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(20, 5), facecolor='white') # Set the facecolor
# Set the background color of each subplot
for ax in axes:
ax.set_facecolor('white')
# Plot the first shap.dependence_plot on the first subplot
shap.dependence_plot(
("dayofweek"),
shap_values_casual,
X_original,
feature_names=NAMES,
show=False,
interaction_index=None,
ax=axes[0] # Use the first subplot
)
# Set the title and its text color for the first subplot
axes[0].set_title("Day Of Weeks (Casual)", color='black', style='italic')
# Plot the second shap.dependence_plot on the second subplot
shap.dependence_plot(
("dayofweek"),
shap_values_registered,
X_original,
feature_names=NAMES,
show=False,
interaction_index=None,
ax=axes[1] # Use the second subplot
)
# Set the title and its text color for the second subplot
axes[1].set_title("Day Of Weeks (Registered)", color='black', style='italic')
# Set the axis labels (x-axis and y-axis) text color
for ax in axes:
ax.xaxis.label.set_color('black')
ax.yaxis.label.set_color('black')
# Set the color of the tick labels and tick marks
ax.tick_params(axis='x', colors='black')
ax.tick_params(axis='y', colors='black')
# Add a horizontal line at y=0.0
axes[0].axhline(y=0.0, color='red', linestyle='--')
axes[1].axhline(y=0.0, color='red', linestyle='--')
# Adjust spacing between subplots
plt.tight_layout()
# Show the plot
plt.show()
To analyze the differences and similarities between casual and registered customers, particularly in terms of their preferences on different types of days (working days vs. weekends), we used two XGBoostRegressor models with separate target variables: "casual" and "registered." The RMSE results of these models were RMSE (casual): 0.1744 and RMSE (registered): 0.1160, respectively.
For a more detailed understanding of the differences based on the type of day, we created two dependence plots focusing on the "dayofweek" feature, but as mentioned before now we have two dependence plots of two different targets casual and registerd.
Understanding the Dependence Plot (one feature):
Here's what we observed from the dependence plots:
Casual Customers:
Registered Customers:
Based on these findings, here are potential marketing strategies for different customer segments (Casual vs. Registered):
Casual Customers:
Registered Customers:
By tailoring marketing strategies to the specific preferences and behaviors of casual and registered customers, the bike rental company can maximize its customer base and overall profitability.
A-4. Based on the model and explanation results, explore the different seasons and atmospheric weather variables (e.g., temperature, feeling temperature, humidity, wind speed) in 2011 and 2012 that might affect the bike rental counts positively and/or negatively (if none, state that there would be no influence)?
What is your marketing strategy for the upcoming years or months if you are the CEO of this bike rental company?
# Define the selected features
selected_features1 = ["winter", "spring", "summer", "fall", "atemp", "temp", "hum", "windspeed"]
selected_features2 = ["year_2011", "year_2012"]
# Create a grid of subplots for selected_features1 vs selected_features2
num_features1 = len(selected_features1)
num_features2 = len(selected_features2)
# Calculate SHAP interaction values
shap_interaction_values = explainer.shap_interaction_values(Xdf)
# Set up the subplot grid
fig, axes = plt.subplots(num_features1, num_features2, figsize=(20, 4 * num_features1))
# Loop through selected_features1 and selected_features2
for i in range(num_features1):
for j in range(num_features2):
feature1 = selected_features1[i]
feature2 = selected_features2[j]
feature_ = (feature1, feature2)
# Create the dependence plot for the pair (feature1, feature2)
shap.dependence_plot(
feature_,
shap_interaction_values, X_original,
display_features=Xdf,
ax=axes[i, j],
show=False
)
# Set the title for the subplot
axes[i, j].set_title(f"{feature1} vs. {feature2}", style="italic", weight='bold')
# Adjust spacing
plt.tight_layout()
# Show the combined plot
plt.show()
To analyze the potential influence of different seasons, atmospheric weather variables, and years on bike rental counts and formulate a marketing strategy, we utilized SHAP (SHapley Additive exPlanations) values and dependence plots. These insights provide valuable information from an explainable artificial intelligence (XAI) perspective.
Understanding dependence plots (two features):
These dependence plots illustrate the interaction between two features. In our context, the x-axis represents the feature of interest (winter, spring, summer, fall, temperature, feeling temperature, humidity, or windspeed) represented as ("winter", "spring", "summer", "fall", "atemp", "temp", "hum", "windspeed") respectively, and the y-axis displays the SHAP values, signifying their impact on bike rental counts. The color of data points provide the values of the second feature which is years represented as ("year_2011", "year_2012") in addition to further insights into the relationship.
Seasonal Influence ("winter", "spring", "summer", "fall"):
Our exploration of the seasons' influence found that interpreting the SHAP values for seasonal features was particularly challenging due to the categorical nature of the features. These features exhibited limited variations in SHAP values, making it difficult to draw definitive conclusions regarding their impact on bike rental counts in both 2011 and 2012. It's important to note that categorical vs. categorical feature interactions can be complex to interpret, and the results should be considered in this context.
Atmospheric Weather Features:
Analyzing atmospheric weather features (temperature, feeling temperature, humidity, windspeed) in relation to the years 2011 and 2012 revealed some noteworthy findings. The interpretation of these plots requires careful consideration of the SHAP values:
Given the predominantly negligible impacts of these seasonal and atmospheric features on bike rentals, formulating a precise marketing strategy based on these factors may not yield significant results. Instead, a more comprehensive marketing strategy should focus on broader aspects, so I would design my marketing strategy for the upcoming years or months as follows:
A-5. List the top 5 pairs of features that show high interactions and explain the interactions (SHAP interaction plot).
# Calculate SHAP interaction values
shap_interaction_values = explainer.shap_interaction_values(Xdf)
# Create a SHAP summary plot
shap.summary_plot(shap_interaction_values, Xdf, feature_names=NAMES, show=False)
# Get the current figure and axes
fig, axes = plt.gcf(), plt.gcf().get_axes()
# Set the background color for all columns
fig.patch.set_facecolor('white')
for ax in axes:
ax.set_facecolor('white')
# Set the title text color
ax.set_title(ax.get_title(), color='black')
# Set the x-label and y-label text color
ax.xaxis.label.set_color('black')
ax.yaxis.label.set_color('black')
# Set the color of the tick labels and tick marks
ax.tick_params(axis='x', colors='black')
ax.tick_params(axis='y', colors='black')
# Show the plot
plt.show()
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
tmp = np.abs(shap_interaction_values).mean(0)
for i in range(tmp.shape[0]):
tmp[i,i] = 0
inds = np.argsort(-tmp.sum(0))[:50]
tmp2 = tmp[inds,:][:,inds]
# Set the figure size
plt.figure(figsize=(10, 10), facecolor='white')
# Create the heatmap with a different colormap
heatmap = plt.imshow(tmp2, cmap='rocket')
# Set the background
plt.gca().set_facecolor('white')
# Set the y-labels and x-labels
plt.ylabel('Features', color='black')
plt.xlabel('Features', color='black')
# Set the y-ticks and x-ticks
plt.yticks(range(tmp2.shape[0]), Xdf.columns.values[inds], rotation=50.4, horizontalalignment="right", color='black')
plt.xticks(range(tmp2.shape[0]), Xdf.columns.values[inds], rotation=50.4, horizontalalignment="left", color='black')
# Show the colorbar
cbar = plt.colorbar(heatmap)
# Set the color of colorbar ticks, labels, and scale numbers
cbar.ax.tick_params(colors='black')
# Set the color of the heatmap ticks
plt.tick_params(axis='both', colors='black')
# Set the x-axis ticks on top
plt.gca().xaxis.tick_top()
# Show the plot
plt.show()
# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 3, figsize=(20, 10), facecolor='white')
# Set the background color of each subplot
for ax_row in axes:
for ax in ax_row:
ax.set_facecolor('white')
# Plot the first shap.dependence_plot on the first subplot
shap.dependence_plot(
("temp", "hum"),
shap_interaction_values, X_original,
ax=axes[0, 0], # Use the first subplot
show=False
)
axes[0, 0].set_title("Temperature vs. Humidity", color='black', style='italic', weight='bold')
axes[0, 0].xaxis.label.set_color('black') # Set x-label text color
axes[0, 0].yaxis.label.set_color('black') # Set y-label text color
axes[0, 0].tick_params(axis='x', colors='black') # Set tick label color
axes[0, 0].tick_params(axis='y', colors='black') # Set tick label color
# Plot the second shap.dependence_plot on the second subplot
shap.dependence_plot(
("hum", "year_2012"),
shap_interaction_values, X_original,
ax=axes[0, 1], # Use the second subplot
show=False
)
axes[0, 1].set_title("Humidity vs. Year 2012", color='black', style='italic', weight='bold')
axes[0, 1].xaxis.label.set_color('black')
axes[0, 1].yaxis.label.set_color('black')
axes[0, 1].tick_params(axis='x', colors='black')
axes[0, 1].tick_params(axis='y', colors='black')
# Plot the third shap.dependence_plot on the third subplot
shap.dependence_plot(
("atemp", "year_2011"),
shap_interaction_values, X_original,
ax=axes[0, 2], # Use the third subplot
show=False
)
axes[0, 2].set_title("Atemp vs. Year 2011", color='black', style='italic', weight='bold')
axes[0, 2].xaxis.label.set_color('black')
axes[0, 2].yaxis.label.set_color('black')
axes[0, 2].tick_params(axis='x', colors='black')
axes[0, 2].tick_params(axis='y', colors='black')
# Plot the fourth shap.dependence_plot on the fourth subplot
shap.dependence_plot(
("hum", "atemp"),
shap_interaction_values, X_original,
ax=axes[1, 0], # Use the fourth subplot
show=False
)
axes[1, 0].set_title("Humidity vs. Atemp", color='black', style='italic', weight='bold')
axes[1, 0].xaxis.label.set_color('black')
axes[1, 0].yaxis.label.set_color('black')
axes[1, 0].tick_params(axis='x', colors='black')
axes[1, 0].tick_params(axis='y', colors='black')
# Plot the fifth shap.dependence_plot on the fifth subplot
shap.dependence_plot(
("month", "year_2012"),
shap_interaction_values, X_original,
ax=axes[1, 2], # Use the fifth subplot
show=False
)
axes[1, 2].set_title("Month vs. Year 2012", color='black', style='italic', weight='bold')
axes[1, 2].xaxis.label.set_color('black')
axes[1, 2].yaxis.label.set_color('black')
axes[1, 2].tick_params(axis='x', colors='black')
axes[1, 2].tick_params(axis='y', colors='black')
# Remove the empty subplot in the last column
fig.delaxes(axes[1, 1])
# Access the color bars and set their color
for ax in axes.flatten():
if ax.collections:
colorbar = ax.collections[0].colorbar
colorbar.ax.yaxis.label.set_color('black') # Set color bar label text color
colorbar.ax.tick_params(axis='y', colors='black') # Set color bar tick label color
# Adjust spacing between subplots
plt.tight_layout()
# Show the plot
plt.show()
To visualize the interactions between different features, we first plotted the SHAP interaction plot, which shows the SHAP values for each pair of features. The SHAP value measures how much the interaction of these two features contributes to the model output, either positively or negatively. The SHAP interaction plot is hard to read and analyze. Therefore, we plotted the heatmap, which quantifies the interaction features. The heatmap is a square grid with rows and columns labeled with feature names. The color scale ranges from 0.0000 to 0.0200, with shade orange or baige we can say being the highest value and black being the lowest value. The highest values in the heatmap indicate the pairs of features that have the most interaction.
Based on the heatmap, we were able to select the top 5 pairs of features from the SHAP interaction plot that show high interactions. These are:
To analyze these pairs of features, we plotted the dependence plots for each pair. The dependence plots show how two features interact to affect the model output for each instance. The x-axis represents the first feature value and the y-axis represents the SHAP values. The color of each data point represents the second feature value. The dependence plots help us to understand the relationship between these pairs of features and how they influence the bike rental counts.
Here's what we observed from the dependence plots:
A-6. Based on the model and explanation results, analyze how the time-related features, such as 'day,' 'month,' 'year_2011,' and 'year_2012,' impact bike rental counts positively and/or negatively.
As the CEO of this bike rental company, what strategies would you implement to optimize bike rentals considering these temporal trends?
selected_features = ["day", "month", "year_2011", "year_2012"]
# Find the indices of the selected features in the original DataFrame
feature_indices = [Xdf.columns.get_loc(feature) for feature in selected_features]
# Sort the SHAP values and DataFrame based on the feature indices
selected_shap_values = shap_values[:, feature_indices]
selected_Xdf = Xdf[selected_features]
# Create a SHAP summary plot for the selected features
shap.summary_plot(selected_shap_values, selected_Xdf, show=False, axis_color="black", plot_size=(20,5))
# Create a figure with three subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
# Create the first dependence plot for "day"
shap.dependence_plot("day", shap_values, X_original, feature_names=NAMES, show=False, interaction_index=None, ax=axes[0], title="Dependence Plot of Day")
# Add a shaded line at y-axis zero
axes[0].axhline(y=0, color="red", linestyle="--", alpha=0.7)
# Create the second dependence plot for "month"
shap.dependence_plot("month", shap_values, X_original, feature_names=NAMES, show=False, interaction_index=None, ax=axes[1], title="Dependence Plot of Month")
# Add a shaded line at y-axis zero
axes[1].axhline(y=0, color="red", linestyle="--", alpha=0.7)
plt.tight_layout() # Adjust subplot spacing
plt.show() # Display the combined plot
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
To illustrate how different features affect the bike rental counts, we used SHAP (SHapley Additive exPlanations) values and different type of plots to analyze . SHAP values measure how much each feature changes the model output, either positively or negatively. We use summary plots and dependence plots to visualize the SHAP values of the features. We also use these insights to suggest some marketing strategies for the bike rental company.
To gain a more in-depth understanding of the impacts of the "day" and "month" features on bike rental counts, we employed Dependence Plots. These plots provided a detailed examination of how these temporal features influenced our predictions, allowing for a more nuanced analysis beyond what the summary plot could reveal.
As the CEO of the bike rental company, these insights into temporal trends offer valuable guidance for optimizing rental strategies:
Leverage Yearly Trends: Given the positive impact of "Year_2012" and the negative influence of "Year_2011," focus on replicating the favorable conditions observed in 2012. Implement marketing campaigns and promotions that align with the factors contributing to increased rentals during that year.
Target Peak Days: Capitalize on the knowledge that specific days, such as the 4th through 9th, drive higher rental activity. Tailor promotional offers, events, or discounts to attract customers during these periods.
Seasonal Promotions: Recognize the seasonality in bike rentals, with months 6 through 10 being particularly favorable. Adjust pricing, marketing, and inventory strategies to cater to the higher demand during these months.
Off-Peak Strategies: During the less favorable months (January and February) and on off-peak days (days 27th to 29th), consider implementing strategies such as promotions, discounts, maintenance, and inventory management to prepare for the busier months/days ahead.
By aligning rental strategies with these temporal trends, we can enhance overall business performance and maximize bike rentals.